Diffusion models

Author

Louie John M. Rubio

Code
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Notes for Physics 312 class readings

General overview/ideas

Diffusion models are also a type of neural network (like large language models). However, compared to large language models, they serve a completely different purpose. Diffusion models are more known to the public as AI image generators. I personally first encountered OpenAI’s own image generation system called DALL-E 2, which is a specific model that generates images from text prompts.

Look at“A Shiba Inu dog wearing a beret and black turtleneck", which I generated using DALL-E 2 around late 2022.

Figure 1: An image generate with DALL-E 2 using the prompt “A Shiba Inu dog wearing a beret and black turtleneck

We’ll first look at a general overview of ideas shown in Anil Ananthaswamy’s post in Quanta Magazine: “The Physics Principle That Inspired Modern AI Art” [1].

  1. Generative models
  • The goal of generative models is to learn a probability distribution for a set of images and generate a new data points that follow the original.

  • Earlier models that produce realistic image include generative adversarial networks (GAN). However, they are hard to train.

  1. Probability distributions

The post motivated how creating data from images works.

  • For a 2 pixel image case - each pixel can be described by a number from 0 to 255. Each pixel is a dimension – thus you can plot this image as a point in 2D space.

The plot below shows a visual example for 3 2-pixel images, and plotting them in 2D space.

Code
sns.set(font_scale = 1.5, style = "white")
seed = 72596
np.random.seed(seed)
im_array = np.rint((np.random.random((3,2))*255))

fig, ax = plt.subplots(1,2, figsize = (10, 5))

ax[0].matshow(im_array, cmap = 'gray')
ax[0].set_yticks([0,1, 2], ["First image", "Second image", "Third image"]);
ax[0].set_xticks([0, 1], ["Pixel 1", "Pixel 2"]);

ax[1].plot(im_array[:,0], im_array[:,1], ls = 'None', marker = '.')
ax[1].set_xlim(0, 255)
ax[1].set_ylim(0, 255)
ax[1].set_aspect(1) #create plot with equal aspect ratio
ax[1].set_title("Three 2 pixel images in 2D space")
Text(0.5, 1.0, 'Three 2 pixel images in 2D space')

Code
sns.set(font_scale = 1.5, style = "white")
seed = 72596

np.random.seed(seed)
im_array = np.rint((np.random.random((2000,2))*255))

fig, ax = plt.subplots(1,2, figsize = (10, 5))

ax[0].plot(im_array[:,0], im_array[:,1], ls = 'None', marker = '.')
ax[0].set_xlim(0, 255)
ax[0].set_ylim(0, 255) 
ax[0].set_title("2000 2-pixel images in 2D space")

sns.histplot(x = im_array[:,0], y = im_array[:,1], ax = ax[1], bins = 50)
ax[1].set_xlim(0, 255)
ax[1].set_ylim(0, 255)
ax[1].set_title("2D histogram = 50 bins")
Text(0.5, 1.0, '2D histogram = 50 bins')

  • Given multiple images (still 2 pixels each), one can bin them in 2D, creating a probability distribution.

The example above is for random 2-pixel images created, so there are also random peaks. This may not be the case for real image data.

  • This probability distribution can be used to generate new images. The generated images should follow the empirical distribution of pixel 1 and 2.

  • Extending this to bigger images, the dimensionality of the problem increases (since each pixel is a dimension). Sampling in each dimension and laying them out together can recreate an image.

  • GANs are hard to train because they sometimes do not learn the full probability distribution for a set of images and can only generate a subset (the example given is: a model trained on different animals sometimes only generate pictures of dogs).

  1. Diffusion models and non-equilibrium dynamics
  • Diffusion models are inspired by nonequilibrium thermodynamics. Nonequilibrium dynamics describes the probability distribution for a diffusion process.

  • In the post, the example used is a drop of ink diffusing in a container.

    • Initially, the blue ink is localized in an area. To calculate the probability of finding an ink molecule in the container, a probability distribution that models the initial state is needed. (This kind of distribution is hard to sample from.)

    • After diffusing through the water, the ink molecules becomes more uniformly distributed over the water. This is easier to express mathematically.

  1. Sohl-Dickstein et al.: Deep Unsupervised Learning using Nonequilibrium Thermodynamics [2]
  • The algorithm for generative modeling is taught how to create noise from images.

    • Given an image from the training set, noise is added to each pixel at every time step. Over time, the values of the pixels, approach the distribution of the noise. (Forward process)
    • Train the neural network on the noisy images to predict less noisy images from a previous step.
  • This diffusion model was initially published by Jascha Sohl-Dickstein [2].

    • Main issues: images are worse than GAN generations

    • Process is slow

  1. Song et al.: Generative Modeling by Estimating Gradients of the Data Distribution [3]
  • Instead of estimating the probability distribution of the data, they used the gradient of the distribution of the data

    • Perturb images in the training dataset with increasing levels of noise.
    • Train the neural network to predict the original image using gradients of the distribution.
  • Song worked on this without knowing about Sohl-Dickstein’s work.

  1. Ho et al.: Denoising diffusion probabilistic models (DDPM) [4]
  • Updated Sohl-Dickstein’s diffusion model with Song’s ideas

  • DDPMs matched or surpassed other generative models, including GANs (benchmark used: comparing distribution of generated images to training set)

  1. Modern diffusion models
  • More recent models all use a variation of DDPM.

  • Other models include training on text to allow generation of images based on text.

  1. Problems encountered
  • Models are prone to bias based on their training datasets.

  • Issue on ethics for scraped data (copyright, etc.)

Latent diffusion models and stable diffusion

Latent diffusion models (LDMs) were created with the goal of reducing the computational complexity of training and sampling without sacrificing the performance of diffusion models [5]. Instead of working on the pixel space, latent diffusion models are “made to learn a space that is perceptually equivalent to the image space1, but offers reduced computational complexity” (which is called the latent space)[5].

Stable Diffusion (SD) is builds upon the work done by Rombach et al.[5], and is a type of text-to-image LDM. SD is trained on Unlike other text-to-image LDMs (like DALL-E 2), SD is open source and is trained on images from the LAION, a non-profit that makes open-sourced AI models and datasets. Aside from using SD to generate images from text prompts, SD can be used for image modification (usually referred to as img2img), using an image as an input and a prompt to guide the generation.

Testing image generation with Stable Diffusion

SD has an available demo for testing out the outputs of SD 2.1. Here’s an example of images generated with the prompt Paris in a rainy day.

(a) Four generations for Paris on a rainy day

(b) Using umbrella as a negative prompt

Figure 2: Generated images using Stable Diffusion 2.1’s demo on HuggingFace the text prompt Paris in a rainy day with and without negative prompts.

In Figure 2 (a), we see that 2 of the images generated the Eiffel Tower, a famous landmark in Paris. Most of the images appear gray-ish and overcast, with some people in the scene walking with umbrellas. When umbrella is used in the negative prompts, there’s no umbrellas in the scene, and interestingly, thescene appears more colorful (even with the ground looks flooded).

Since no seed is set, generations with the demo are still probabilistic, and testing to check how much of the generated images were affected by adding the negative prompt would require multiple generation of images. Based on this blog post, it seems that negative prompts have more impact on newer SD models than the previous ones.

Regardless, the fact that it’s now possible to start from noise and generate an image that matches the prompt is still mind-boggling.

Real life problems concerning diffusion models

The rate at which AI image generation is developing is happening at a very rapid pace. With that being said, there are still issues that plague generative models in relation to the datasets the models are being trained on.

Biases in the dataset

Since diffusion models generate data that follow the distribution of the training set, any biases in the data can be reflected in generated data. The end of the Quanta Magazine article [1] talks about how an avatar generator is creating sexualized images for women but not for men.

Key takeaways

  • Diffusion models take a lot of inspiration from nonequilibrium thermodynamics.

  • The main goal of diffusion models is to create a sample that closely follows a modeled data distribution. A diffusion model achieves this by gradually adding noise to a distribution and learning how to remove this noise (or recreate a sample from the noise).

  • Latent diffusion models performs the diffusion in the latent space (like an image embedding), which has a lower dimension than the pixel space.

  • Many of the problems arising from the rapid advancement of diffusion models relate to the training data: the ethics behind the use of copyrighted data and possible harmful biases from the dataset.

References

[1]
[2]
J. Sohl-Dickstein, E. A. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics.” arXiv, 2015. doi: 10.48550/ARXIV.1503.03585.
[3]
Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution.” arXiv, 2019. doi: 10.48550/ARXIV.1907.05600.
[4]
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” arXiv preprint arxiv:2006.11239, 2020.
[5]
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models.” arXiv, 2021. doi: 10.48550/ARXIV.2112.10752.

Footnotes

  1. This sounds like what word embeddings are trying to do in LLMs, but for images.↩︎